Lexical Access with a Statistically-Derived Phonetic Network
نویسندگان
چکیده
A probabilistic approach to lexieal access from a recognized phone sequence is presented. Lexical access is seen as finding the word sequence that maximizes the lexical likelihood of a sequence of phones and durations as recognized by a phone recognizer. This is theoretically correct for minimum error rate recognition within the model presented and is intuitively pleasing since it means that the "confusion matrix" of the phone recognizer will be learned and its regularities exploited. The lexical likelihoods are estimated from training data provided by the phone recognizer using statistical decision trees. Classification trees are used to estimate the phone realiziation distributions and regression trees are used to estimate the phone duration distributions, We find they can capture effectively allophonic variation, alternative pronunciation, word co-articulation and segmental durations. We describe a simpified, but efficient implementation of these models to lexical access in the DARPA resource management recognitiion task. 1. I N T R O D U C T I O N We describe a new approach to lexical access in a phonebased speech recognition system. By "lexical access" we mean taking a sequence (or, more generally, a lattice) of phones and durations that is output by a phone recognizer and mapping it onto a word sequence (or, more generally, a lattice). In conventional word-based speech recognizers, segmental durations, word co-articulation and alternative pronunciations are usually poorly modelled if at all since the architecture is not convenient or efficient for exploiting these constraints. Phone-based recognition offers an attractive alternative from this point of view. Our approach will be to create a probabilistic model that provides the likelihood that a particular word sequence gives rise to a particular phone sequence. This model will take into account allophonic variation, alternative pronunciation, word co-articulation and segmental durations. We then combine these lexical likelihoods with the acoustic likelihoods generated by the phone recognizer and priors from our language model to get an overall recognition model whose error rate we seek to minimize. We have taken this stochastic approach for two reasons. First, it provides a principled way to combine seemingly disparate information: (a) acoustic likelihoods, (b) segmental durations, (c) alternative pronunciations, and (d) the language model. Second, the availability of large speech corpora now allow the statistical estimation of these probabilities. 2. P R O B A B I L I S T I C M O D E L We form the probabilistic model as follows. Let w be a sequence of words, let y be a sequence of phones, let d be a sequence of durations, and let s be a (fixed) speech signal. Then P(wis ) ~x Z P ( s l y , d ) P ( y , d [ w ) P(w). (2.1) y,d The lefthand side of this relation is the probability that a given speech signal corresponds to a particular word sequence. The word sequence that maximizes this term gives the minimum sentence error rate. The first factor on the righthand side gives the acoustic likelihoods provided by the phone recognizer. The second factor gives the lexical likelihoods to be provided by the lexlcal access stage describe here. The third factor represents whatever language model we use. In this paper, we have used the output of the current Bell Labs phone recognizer as input to the lexical access component [1]. At present, this recognizer outputs a single sequence of phones and durations per utterance, which represents its best estimate of the true sequence. As such, y and d are fixed in Eq. 2.1 for a given speech signal. A more general approach, which would consider alternative sequences phone lattices is currently under investigation, but not reported here. Also in this paper, in which we present results on the DARPA resource management task, we consider only the the simple wordpair language model. Thus, for a given utterance, the best scoring word sequence, w, will be the one that maximizes the lexical likelihood, P(y , d[w) for a given phone recognizer output y and d, and which is a legal sequence in the word-pair grammar. In this model, finding the word sequence that maximizes this likelihood is the goal of lexical access and estimating this likelihood is the goal of this paper. A crucial factor for this estimation is that y and d are not the true sequence of phones and durations, but the output of a phone recognizer. As such, we must train our estimator on the output of the phone recognizer. This is theoretically correct for minimum error rate recognition in tnis model and is intuitively pleasing since it means that the model will learn the "confusion
منابع مشابه
Production of English Lexical Stress by Persian EFL Learners
This study examines the phonetic properties of lexical stress in English produced by Persian speakers learning English as a foreign language. The four most reliable phonetic correlates of English lexical stress, namely fundamental frequency, duration, intensity, and vowel quality were measured across Persian speakers’ production of the stressed and unstressed syllables of five English disyllabi...
متن کاملInteraction and representational integration: evidence from speech errors.
We examine the mechanisms that support interaction between lexical, phonological and phonetic processes during language production. Studies of the phonetics of speech errors have provided evidence that partially activated lexical and phonological representations influence phonetic processing. We examine how these interactive effects are modulated by lexical frequency. Previous research has demo...
متن کاملA Combined Phonetic-Phonological Approach to Estimating Cross-Language Phoneme Similarity in an ASR Environment
This paper presents a fully automated linguistic approach to measuring distance between phonemes across languages. In this approach, a phoneme is represented by a feature matrix where feature categories are fixed, hierarchically related and binary-valued; feature categorization explicitly addresses allophonic variation and feature values are weighted based on their relative prominence derived f...
متن کاملPhonetic Characterisation and Lexical Access in Non-segmental Speech Recognition
An isolated-word speech recognition system, built without the use of linear segments for acoustic modelling or lexical access, is justified, described and demonstrated. The system comprises phonetic feature analysis operating on four independent tiers, parallel phonotactic parsing, and lexical access based on a neural-network inspired lexicon structure. Performance is however still inferior to ...
متن کاملImproving Spoken Dialogue Understanding Using Phonetic Mixture Model
Augmenting word tokens with a phonetic representation, derived from a dictionary, improves the performance of a Natural Language Understanding component that interprets speech recognizer output: we observed a 5% to 7% reduction in errors across a wide range of response return rates. The best performance comes from mixture models incorporating both word and phone features. Since the phonetic rep...
متن کاملImproving Spoken Dialogue Understanding Using Phonetic Mixture Models
Reasoning about sound similarities improves the performance of a Natural Language Understanding component that interprets speech recognizer output: we observed a 5% to 7% reduction in errors when we augmented the word strings with a phonetic representation, derived from the words by means of a dictionary. The best performance comes from mixture models incorporating both word and phone features....
متن کامل